[1] 165034 14
2024-04-18
Support Vector Machines (SVMs) are powerful supervised learning models used for both classification and regression tasks. By constructing a hyperplane in a high-dimensional space, SVMs achieve class separation by maximizing the margin between the closest data points of each class. This approach not only enhances the model’s accuracy but also its predictive reliability across datasets with numerous features or clear separations between classes.
- Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features.
Image A Shows a clear linear separation, demonstrating the SVM hyperplane in a scenario where data is linearly separable.
Image B Illustrates data that is not linearly separable in its original space but can be tackled by SVM through the use of a kernel function.
Support Vector Machines (SVMs) serve as a robust methodology for binary classification by creating a hyperplane which acts as a decision boundary between two classes. This hyperplane is determined mathematically by the equation \(w^T.x +b=0\), where \(w\) is the weight vector perpendicular to the hyperplane, and \(b\) is the bias,shifting the hyperplane away from the origin.
Kernel Function in SVM
In Support Vector Machines (SVM), the kernel function plays a crucial role in transforming the input feature space into a higher-dimensional space where the data can be linearly separated. This is particularly useful in cases where the data is not linearly separable in its original space. The kernel function computes the dot product between the feature vectors in this higher-dimensional space without explicitly mapping the vectors into that space, which is known as the “kernel trick.”
Common types of kernel functions include:
Linear Kernel: \(K(w,b)=w^Tx+b\). This is the simplest form of the kernel, used when the data is linearly separable.
Polynomial Kernel: \(K(w, b) = (1 + w^T.x b)^d\). This kernel maps the input features into a polynomial feature space, allowing for polynomial decision boundaries.
Radial Basis Function (RBF) Kernel: \(K(x_i, x_j) = e^{-\gamma \|x_i - x_j\|^2}\). Also known as the Gaussian kernel, it maps the features into an infinite-dimensional space, providing a lot of flexibility for non-linear decision boundaries.
Each kernel function has its own set of parameters that need to be tuned for optimal performance. The choice of kernel function and its parameters can significantly impact the SVM model’s ability to capture the underlying patterns in the data.
The objective function that SVM optimizes is a combination of maximizingthe margin and minimizing the classification error. This is achievedthrough the minimization of the following objective function: \[min_{w, b} \frac{1}{2} \|w\|^2 + C \sum_{i=1}^{n} \xi_i\] Subject to the constraints: \[y_i (w^T x_i + b) \geq 1 - \xi_i \quad \text{and} \quad \xi_i \geq 0 \quad \text{for all } i\] where \(w\) is the weight vector, \(b\) is the bias term, \(C\) is theregularization parameter, \(\xi_i\) are the slack variables representingthe degree of misclassification of the \(i\)-th data point, and \(y_i\) arethe class labels.
Hinge Loss:
The hinge loss function is used in SVM to penalize misclassifications. It is defined as: Hinge loss = \(\max(0, 1 - y_i (w^T x_i + b))\) The hinge loss is zero for correctly classified points that are outside the margin, and it increases linearly for points that are on the wrong side of the hyperplane or within the margin.
The optimization of the objective function involves finding the values of \(w\) and \(b\) that minimize the function, subject to the constraints. This is typically done using quadratic programming techniques.
The main objective of this project is to utilize the Support Vector Machine (SVM) algorithm to effectively predict customer churn at ABC Multinational Bank. This predictive model aims to identify key indicators that signal the likelihood of customers opting to leave the bank. By understanding these indicators, the bank can deploy targeted interventions to improve customer satisfaction and retention. Ultimately, this effort will enable ABC Multinational Bank to take proactive measures in retaining valuable customers, thereby stabilizing their customer base and enhancing long-term business sustainability.
The dataset utilized in this project was sourced from Kaggle, a platform known for providing a wide range of high-quality datasets. The attributes of dataset are: ## Data Preprocessing
| Column Name | Description |
|---|---|
customer_id |
A unique identifier for each customer, not used in the analysis. |
credit_score |
A numerical representation of the customer’s creditworthiness. |
country |
The country in which the customer resides. |
gender |
The gender of the customer (e.g., male, female). |
age |
The age of the customer in years. |
tenure |
The number of years the customer has been with the bank. |
balance |
The current balance in the customer’s account. |
products_number |
The number of products the customer has with the bank. |
credit_card |
Indicates whether the customer has a credit card with the bank. |
active_member |
Indicates whether the customer is an active member. |
estimated_salary |
The estimated annual salary of the customer. |
churn |
The target variable, indicating customer churn (1 for churned, 0 for not churned). |
This dataset consists of 14 columns and 165034 rows.
[1] 165034 14
id CustomerId Surname CreditScore Geography
0 0 0 0 0
Gender Age Tenure Balance NumOfProducts
0 0 0 0 0
HasCrCard IsActiveMember EstimatedSalary Exited
0 0 0 0
There are no null values in the dataset.
Geography Gender HasCrCard IsActiveMember Exited
3 2 2 2 2
Geography: Geography column have 3 uniques values Framce,Germany, Spain.
Gender: Gender column have 2 unique values Male, Female.
IsActiveMember: Column consists of 2 unique values yes and no,
HasCrCard: Column consists o 2 unique values yes or no indicating is user have credit card or not.
Exited: This is a target column which indicates weather the customer is exited the bank or not.
CreditScore Age Tenure Balance
Min. :350.0 Min. :18.00 Min. : 0.00 Min. : 0
1st Qu.:597.0 1st Qu.:32.00 1st Qu.: 3.00 1st Qu.: 0
Median :659.0 Median :37.00 Median : 5.00 Median : 0
Mean :656.5 Mean :38.13 Mean : 5.02 Mean : 55478
3rd Qu.:710.0 3rd Qu.:42.00 3rd Qu.: 7.00 3rd Qu.:119940
Max. :850.0 Max. :92.00 Max. :10.00 Max. :250898
NumOfProducts EstimatedSalary
Min. :1.000 Min. : 11.58
1st Qu.:1.000 1st Qu.: 74637.57
Median :2.000 Median :117948.00
Mean :1.554 Mean :112574.82
3rd Qu.:2.000 3rd Qu.:155152.47
Max. :4.000 Max. :199992.48
Credit Score: Ranges from 350 to 850, with a median of 659,indicating a mid-range creditworthiness among
Age: Customers’ ages range from 18 to 92 years, with a median age of 37, suggesting a predominantly middle-aged clients.
Tenure: Tenure with the bank varies from 0 to 10 years, with a median of 5 years, showing that customers are fairly evenly distributed in terms of loyalty.
Balance: Account balances range up to $250,898, but the median balance is 0, indicating that many customers maintain low or no balances.
Number of Products: Most customers have between 1 and 2 banking products, with a median of 2 products per customer.
Estimated Salary: Salaries vary widely, up to $199,992.48, with a median of $117,948, reflecting a broad spectrum of income levels among the bank’s clientele.
Observations:
The largest concentration of customers falls within the 30 to 40-year-old range, indicating that the majority of customers are in their early to mid-career stages.
There is a significant drop in frequency as age increases, especially beyond 50 years. This suggests that the customer base skews younger.
The distribution is right-skewed, meaning there are fewer older customers (those over 60) compared to younger customers.
There is a small number of customers in the youngest age bracket (under 25 years) and the oldest (over 75 years).
Observations:
The y-axis represents the balance on customer accounts, which seems to range from 0 to a bit over 250,000.
Both boxes have a similar interquartile range (IQR), which is the range between the first quartile (25th percentile) and the third quartile (75th percentile), represented by the height of the boxes. This suggests that the middle 50% of balances are similarly distributed between both groups.
The median, indicated by the line within each box, is roughly at the same level for both groups, suggesting that the central tendency of balance is similar regardless of whether the customer has exited or not.
The boxplot shows no apparent outliers, as there are no data points beyond the whiskers which represent 1.5 times the interquartile range.
Observations:
France has the highest count of customers using one product, followed closely by those using two products. The number of customers using three and four products is significantly lower.
Germany shows a similar pattern to France with one and two products being the most common among customers. However, the count for one product is notably lower than in France, whereas the count for two products is slightly higher.
Spain’s pattern mirrors that of France and Germany, with one product being the most common, followed by two products. Again, three and four products are used by a considerably smaller number of customers.
Observations:
The highest churn rate is in Germany(37.9%) followed by Spain(17.22%) and least in France(16.53%)
Observations:
Both exited and non-exited customers are found across the entire range of Credit Scores and Age, but there is a noticeable density of exited customers (blue dots) in the middle age range, particularly between ages 40 and 50.
Observations:
There seems to be a noticeable positive correlation between Age and Balance, and a negative correlation between NumOfProducts and Balance.
Observations:
We can see there is significant imbalance in data this is addressed during model building.
One Hot Encoding: As geography is categorical column so we have performed one hot encoding to convert in to numerical coulmns for each value type.
Label Encoding: Peformed Label encoding on Gender and Exited column to convert them to numerical columns.
We split our data into training and testing sets using 70-30 rule to evaluate the performance of our models on unseen data.
To address the class imbalance in our target variable, we employed the ROSE package, which implements a combination of over-sampling and under-sampling techniques. This method effectively created a more balanced distribution of classes, ensuring better representation of both majority and minority groups in our dataset.
Normalization is a crucial preprocessing step that helps in standardizing the data values within a specific range, typically zero mean and unit variance. This is particularly important for algorithms that assume data is normally distributed, or for those that are sensitive to the scale of the input features, such as Support Vector Machines or k-nearest neighbors.
In normalization we have calculated the mean and standard deviation for selected numerical columns in the training data, using the apply function in R to avoid data leakage from the test set. The training data is then standardized by subtracting the mean and dividing by the standard deviation for each column, a method known as Z-score normalization. This same scaling approach is applied to the test set to ensure both datasets are on a comparable scale, facilitating more accurate model training and evaluation.
In this modeling phase, two different machine learning models are trained for a classification task using the caret package in R. The models include Support Vector Machine (SVM) with a radial basis function kernel and the Random Forest model.
The target variable Exited is converted to a factor to ensure that it is treated as a categorical variable for classification. A consistent seed (set.seed(123)) is set before training each model to ensure reproducibility of the results.
The trainControl function is used to set up the training control parameters, specifying 5-fold cross-validation (method = “cv”) to assess the performance of the models.
cross-validation is a statistical technique used to evaluate the performance and stability of machine learning models. In this approach, the data set is randomly partitioned into \(K\) equal or nearly equal sized sub-datasets or “folds.” The model is then trained and tested \(K\) times, with each of the five folds used exactly once as the validation data, and the remaining four folds used for training.
Benefits of K-Fold Cross-Validation
Reduces Overfitting: By using different subsets of the data for training and validation, K-fold cross-validation reduces the risk of the model overfitting to a specific portion of the data.
Improves Model Generalizability: Since the model is validated multiple times against different subsets of data, it ensures that the model performs well across various samples of the data, not just on the data it was trained on. This helps in assessing the model’s ability to generalize to new, unseen data.
Efficient Use of Data: Unlike a simple train/test split, cross-validation allows every observation in the dataset to be used for both training and validation. This is especially beneficial when dealing with limited data resources, as it maximizes the amount of training data available.
Reliable Performance Estimation: Each data point gets to be in a test set exactly once and in a training set four times. This comprehensive involvement ensures that the performance metric you compute over the folds is more reliable and robust, as it incorporates a wider range of scenarios.
Minimizes Bias: The random shuffling and partitioning of data into folds help minimize bias associated with the order or any potential patterns in the data collection process. This randomization helps ensure that the validation process is as impartial as possible.
Data Preprocessing: Centering and scaling are crucial for SVM performance because it ensures that all features contribute equally to the distance calculations in the feature space.
Model Overview: SVM is a powerful classification technique that works by finding a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points. For non-linearly separable data, SVM uses a kernel trick to transform the data into a higher-dimensional space where a hyperplane can be used for separation.
Radial Basis Kernel Function: The SVM model uses a Radial Basis Function (RBF) kernel to handle non-linear separation between classes. This kernel function is defined as
\[ K(x_i, x_j) = e^{-\gamma \|x_i - x_j\|^2} \] Here, \(xi\) and \(xj\) are two feature vectors in the input space, γ is a parameter that defines the spread of the kernel, and \(\|x_i - x_j\|^2\) is the squared Euclidean distance between the two feature vectors.
After Incorporating our data columns, the SVM’s RBF kernel equation looks like below.
\[ K(x_i, x_j) = e^{-\gamma (\|CreditScore_i - CreditScore_j\|^2 + \|Age_i - Age_j\|^2 + \|Balance_i - Balance_j\|^2 + \|ProductsNumber_i - ProductsNumber_j\|^2 + \|EstimatedSalary_i - EstimatedSalary_j\|^2 + \ldots)} \]
The trained models are tested on Test set and compared the performances of both models.
Accuracy: The percentage of total customers correctly predicted as churned or not churned by the model.
Sensitivity (Recall): The proportion of actual churned customers that the model correctly identifies (True Positives).
Specificity: The proportion of customers who have not churned that the model correctly predicts as not exiting (True Negatives).
Kappa: The measure of agreement between the churn predictions and actual churn instances, corrected for chance agreement.
Positive Predictive Value (Precision): The proportion of customers who have not churned that the model correctly predicts as not exiting (True Negatives).
Negative Predictive Value: The likelihood that a customer predicted to not churn by the model has indeed not churned.
Balanced Accuracy: An average of the model’s ability to correctly identify both churned and retained customers, crucial for datasets where churn events are less common.
AUC-ROC: The probability that the model ranks a randomly chosen churned customer higher than a randomly chosen customer who hasn’t churned, indicating how well the model distinguishes between the two groups.
| Metric | SVM | Random Forest |
|---|---|---|
| Accuracy | 79.52% | 83.38% |
| Kappa | 0.494 | 0.5375 |
| Sensitivity | 79.11% | 86.48% |
| Specificity | 81.09% | 71.70% |
| Positive Predictive Value | 94.02% | 91.99% |
| Negative Predictive Value | 50.82% | 58.54% |
| Balanced Accuracy | 80.10% | 79.09% |
| AUC-ROC | 0.801 | 0.791 |
In this analysis of bank customer churn prediction, the Support Vector Machine (SVM) model has shown promising results, particularly in terms of specificity (81.09%) and positive predictive value (94.02%). These metrics are crucial in the banking context, as they indicate the model’s accuracy in correctly identifying loyal customers (specificity) and its reliability in flagging potential churners (positive predictive value). Additionally, the SVM model’s highest AUC-ROC score of 0.801 underscores its effectiveness in distinguishing between customers who are likely to churn and those who are not across various decision thresholds.
Although the Random Forest model exhibited the highest overall accuracy (83.38%) and kappa score (0.5375), its lower specificity and negative predictive value compared to the SVM model suggest it may produce more false positives, leading to misallocated retention efforts.